Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion
نویسنده
چکیده
In this article, we propose a noisy channel/information restoration model for error recovery problems in Chinese natural language processing. A language processing system is considered as an information restoration process executed through a noisy channel. By feeding a large-scale standard corpus C into a simulated noisy channel, we can obtain a noisy version of the corpus N. Using N as the input to the language processing system (i.e., the information restoration process), we can obtain the output results C'. After that, the automatic evaluation module compares the original corpus C and the output results C', and computes the performance index (i.e., accuracy) automatically. The proposed model has been applied to two common and important problems related to Chinese NLP for the Internet: corrupted Chinese text restoration and GB-to-BIG5 conversion. Sinica Corpora version 1.0 and 2.0 are used in the experiment. The results show that the proposed model is useful and practical.
منابع مشابه
HHMM-based Chinese Lexical Analyzer ICTCLAS
This document presents the results from Inst. of Computing Tech., CAS in the ACLSIGHAN-sponsored First International Chinese Word Segmentation Bakeoff. The authors introduce the unified HHMM-based frame of our Chinese lexical analyzer ICTCLAS and explain the operation of the six tracks. Then provide the evaluation results and give more analysis. Evaluation on ICTCLAS shows that its performance ...
متن کاملText Image Restoration using Adaptive Fuzzy Median Based on 3D Tensors and Iterative Voting
This paper addresses the problem of efficient and effective restoration of text images, by formulating the problem as inferring the surface from a sparse and noisy point set in a 3D structure tensor space. Given a set of noisy data correspondence in corrupted images, the proposed method extracts good matches and rejects the noisy elements. The methodology is unconventional, since, unlike most o...
متن کاملA Unified Approach to Transliteration-based Text Input with Online Spelling Correction
This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-based text input methods, such as Chinese and Japanese, because the desired target characters canno...
متن کاملThe Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model
The techniques of image processing have been used in optical character recognition (OCR) for a long time. The recognition method evolved from early "pattern recognition" to "feature extraction" recently. The recognition rate is raised from 70% to 90%. But the character by character recognition technique has its limitation. Using language models to assist the OCR system in improving recognition ...
متن کاملارائه یک روش جدید بازیابی اطلاعات مناسب برای متون حاصل از بازشناسی گفتار
In this article a pre-processing method is introduced which is applicable in speech recognized texts retrieval task. We have a text corpus, t generated from a speech recognition system and a query as inputs, to search queries in these documents and find relevant documents. A basic problem in a typical speech recognized text is some error percentage in recognition. This, results erroneously ass...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IJCLCLP
دوره 3 شماره
صفحات -
تاریخ انتشار 1998